HDFS

Before talking about HDFS, let me tell you, what is a Distributed File System?
Distributed File System talks about managing data, i.e. files or folders across multiple computers or servers. In other words, DFS is a file system that allows us to store data over multiple nodes or machines in a cluster and allows multiple users to access data.The only difference is that, in case of Distributed File System, you store data in multiple machines rather than single machine. Even though the files are stored across the network, DFS organizes, and displays data in such a manner that a user sitting on a machine will feel like all the data is stored in a single machine.​

HDFS and Hadoop history
In 2006, Hadoop's originators ceded their work on HDFS and MapReduce to the Apache Software Foundation project. The software was widely adopted in big data analytics projects in a range of industries. In 2012, HDFS and Hadoop became available in Version 1.0. The basic HDFS standard has been continuously updated since its inception.  With Version 2.0 of Hadoop in 2013, a general-purpose YARN resource manager was added, and MapReduce and HDFS were effectively decoupled. Thereafter, diverse data processing frameworks and file systems were supported by Hadoop. While MapReduce was often replaced by Apache Spark, HDFS continued to be a prevalent file format for Hadoop.

After four alpha releases and one beta, Apache Hadoop 3.0.0 became generally available in December 2017, with HDFS enhancements supporting additional NameNodes, erasure coding facilities and greater data compression. At the same time, advances in HDFS tooling, such as LinkedIn's open source Dr. Elephant and Dynamometer performance testing tools, have expanded to enable development of ever larger HDFS implementations.

The HADOOP DISTRIBUTED FILES SYSTEM(HDFS)
  • HDFS is storage layer of Hadoop which is distributed and scalable file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.
  • A file system which can store any type of DATA(text,ORC,Avro,Parquet ...)
  • Provides inexpensive and reliable storage for massive amount of data.
  • HDFS performs best with a "modest" number of large files.
  • Files in HDFS are "written once"(append and random write are not allowed).
 HDFS Feature
  • HDFS is a file system written in JAVA
  • Sits on Top of native filesystem.
  • Scalable and distributed storage
  • It is Fault Tolerant and designed using low-cost hardware.
  • Streaming Data access
  • Write once and read many model
  • Supports Storage of very large DataSets
  • Supports efficient processing with MapResduce,Spark and other framework. 


  Difference between  regular file System and HDFS.
  1. Regular FileSystem: In regular FileSystem, data is maintained in a single system. If the machine crashes, data recovery is challenging due to low fault tolerance. Seek time is more and hence it takes more time to process the data.
  2. HDFS: Data is distributed and maintained on multiple systems. If a DataNode crashes, data can still be recovered from other nodes in the cluster. Time taken to read data is comparatively more, as there is local data read to the disc and coordination of data from multiple systems.

No comments:

Post a Comment